12 research outputs found

    Building ontologies from folksonomies and linked data: Data structures and Algorithms

    Get PDF
    We present the data structures and algorithms used in the approach for building domain ontologies from folksonomies and linked data. In this approach we extracts domain terms from folksonomies and enrich them with semantic information from the Linked Open Data cloud. As a result, we obtain a domain ontology that combines the emergent knowledge of social tagging systems with formal knowledge from Ontologies

    Social tags and linked data for ontology development: a case study in the financial domain

    Get PDF
    We describe a domain ontology development approach that extracts domain terms from folksonomies and enrich them with data and vocabularies from the Linked Open Data cloud. As a result, we obtain lightweight domain ontologies that combine the emergent knowledge of social tagging systems with formal knowledge from Ontologies. In order to illustrate the feasibility of our approach, we have produced an ontology in the financial domain from tags available in Delicious, using DBpedia, OpenCyc and UMBEL as additional knowledge sources

    In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central

    Get PDF
    Motivation Although full-text articles are provided by the publishers in electronic formats, it remains a challenge to find related work beyond the title and abstract context. Identifying related articles based on their abstract is indeed a good starting point; this process is straightforward and does not consume as many resources as full-text based similarity would require. However, further analyses may require in-depth understanding of the full content. Two articles with highly related abstracts can be substantially different regarding the full content. How similarity differs when considering title-and-abstract versus full-text and which semantic similarity metric provides better results when dealing with full-text articles are the main issues addressed in this manuscript. Methods We have benchmarked three similarity metrics – BM25, PMRA, and Cosine, in order to determine which one performs best when using concept-based annotations on full-text documents. We also evaluated variations in similarity values based on title-and-abstract against those relying on full-text. Our test dataset comprises the Genomics track article collection from the 2005 Text Retrieval Conference. Initially, we used an entity recognition software to semantically annotate titles and abstracts as well as full-text with concepts defined in the Unified Medical Language System (UMLS®). For each article, we created a document profile, i.e., a set of identified concepts, term frequency, and inverse document frequency; we then applied various similarity metrics to those document profiles. We considered correlation, precision, recall, and F1 in order to determine which similarity metric performs best with concept-based annotations. For those full-text articles available in PubMed Central Open Access (PMC-OA), we also performed dispersion analyses in order to understand how similarity varies when considering full-text articles. Results We have found that the PubMed Related Articles similarity metric is the most suitable for full-text articles annotated with UMLS concepts. For similarity values above 0.8, all metrics exhibited an F1 around 0.2 and a recall around 0.1; BM25 showed the highest precision close to 1; in all cases the concept-based metrics performed better than the word-stem-based one. Our experiments show that similarity values vary when considering only title-and-abstract versus full-text similarity. Therefore, analyses based on full-text become useful when a given research requires going beyond title and abstract, particularly regarding connectivity across articles. Availability Visualization available at ljgarcia.github.io/semsim.benchmark/, data available at http://dx.doi.org/10.5281/zenodo.13323.The authors acknowledge the support from the members of Temporal Knowledge Bases Group at Universitat Jaume I. Funding: LJGC and AGC are both self-funded, RB is funded by the “Ministerio de Economía y Competitividad” with contract number TIN2011-24147

    Biotea-Biolinks: A semantic infrastructure for exploring and analyzing scientific publications

    Get PDF
    Background Despite the dissemination of scientific publications, most of their information remains locked up in discrete documents, not always interconnected or machine-readable. This, together with the continuous growth of scientific literature, makes difficult simple tasks such as categorizing and finding similar documents. Results Biotea provides both a semantic model and workflow to represent metadata, references and contents from publications, adding on top of it an enriched level where biomedical expressions are semantically annotated (i.e., identified, extracted and associated to ontological concepts). We have applied our model to the full-text, open-access subset of PubMed Central. We take advantage from such a semantic infrastructure by applying Biolinks principles. Biolinks proposes a reclassification of the Unified Medical Language System semantic groups. Such reclassification is later used to semantically characterize and compare publications. Conclusions Biotea and Biolinks embed publications in the Linked Open Data cloud facilitating interoperability and querability, and contributing to enable literature-based knowledge discovery.Motivación A pesar de la diseminación de publicaciones científicas, la mayor parte de la información no está necesariamente interconectada ni es procesable por máquinas. Esto, junto con el continuo crecimiento de la producción científica, dificulta tareas como categorizar y comparar documentos científicos. Resultados Biotea ofrece un modelo semántico junto con los procedimientos para representar metadatos, referencias y contenido, enriqueciéndolos con anotaciones semánticas en el área de la Biomedicina (es decir identificación y extracción de expresiones asociadas con conceptos ontológicos). Biotea se utilizó para modelar artículos completos del subconjunto abierto de PubMed Central. Para aprovechar esta infraestructura semántica desarrollamos Biolinks. Biolinks propone una reclasificación de los grupos semánticos del Unified Medical Language System utilizada para caracterizar y comparar publicaciones desde un punto de vista semántico. Conclusiones Biotea y Biolinks posicionan las publicaciones en la nube del Linked Open Data, facilitando interoperabilidad y consultas, contribuyendo además al descubrimiento de conocimiento basado en literatura
    corecore